knitr::opts_chunk$set(cache =TRUE)

Preface

This R notebook was originally written to be a final project for a course in data science at the University of Maryland, CMSC320. Since this time, I have continued to use this notebook to explore other things on my own, and to take my project beyond the initial specifications given in class. In this notebook, you can expect to find examples of data visualizations, data scrubbing, exploratory analysis, text mining, and machine learning. Of course, there is much more to data science (let alone computer science) than just these topics, but the goal of this notebook is to demonstrate my thinking process while creating something original. In this notebook, I will identify a problem and make a model which helps address it. Skip straight to Part 5 if that is what you would like to see.

Data Analysis Project: InsideAirbnb data

Part 1: Background and motivation

Motivaion: Airbnb is a temporary housing service that has taken off in recent years in competition with the hotel and rental housing industries. Importantly, it drastically reduces costs for guests. However, Airbnb is still largely unregulated (as of 2017-2018). In large American cities like Chicago, the landlord is also usually legally obligated to inform renters of certain aspects of the property, up to and including safety hazards and history of building code violations. These regulations are enforced by the city. I am not particularly familiar with Airbnb’s terms of service, but I do not imagine they are as strict about this type of thing, nor do I believe they have the same amount of resources as a city to properly vet properties and hosts or enforce their own rules. The other temporary housing options are typically vetted by some governmnetal body, and have tend to have a larger base of consumer reviews.

Thus, the question we aim to answer through our data analysis is a very broad one: If Airbnb is a very cheap temporary living option but also relatively unregulated, how do I, as a prospective guest, choose the correct place to stay in, or help reduce the risks, if any, associated with using Airbnb in order to visit a major US city like Chicago on the cheap?

Part 2: An exploratory data analysis; Chicago

library(tidyverse)
library(stringr)
library(ggplot2)

#Load in the listing data, and select only the fields which seem useful
whole_listings_tab <- read_csv("listings.csv")
#whole_listings_tab
part_listings_tab <- whole_listings_tab %>% 
  select(id, host_id, host_since, host_response_time, host_response_rate,
    host_is_superhost, host_listings_count, host_identity_verified, property_type,
    price, minimum_nights, maximum_nights, availability_30, availability_60,
    availability_90, availability_365, number_of_reviews, review_scores_rating,
    review_scores_accuracy, review_scores_cleanliness, review_scores_checkin,
    review_scores_communication, review_scores_value, reviews_per_month, license,
    room_type, bed_type, longitude, latitude, zipcode, neighbourhood_cleansed, bed_type,
    house_rules, interaction, description, neighborhood_overview, space, summary, host_about)
part_listings_tab
## # A tibble: 5,207 x 38
##          id  host_id host_since host_response_time host_response_rate
##       <int>    <int> <date>     <chr>              <chr>             
##  1 13824783 55020055 2016-01-17 within an hour     100%              
##  2 16740225 36722941 2015-06-25 within an hour     100%              
##  3 18125245 51669215 2015-12-18 within an hour     100%              
##  4  8362570 32837114 2015-05-07 within a few hours 100%              
##  5   789867  2782694 2012-06-29 within an hour     100%              
##  6 16701336 36722941 2015-06-25 within an hour     100%              
##  7  8148031 43018961 2015-08-31 within an hour     100%              
##  8  6713059  2441740 2012-05-22 N/A                N/A               
##  9 17620535   756599 2011-06-29 within an hour     100%              
## 10 12581064 23503681 2014-11-08 within an hour     100%              
## # ... with 5,197 more rows, and 33 more variables

Preprocessing: host_since, host_about, license, host_is_superhost, host_identity_verified, price, host_response_time

We only care about the length of time someone has been a host for the data analysis. We are first going to consider whether a host has an about section (though we will do some text mining on the text sections later on). Similarly, we aren’t going to be verifying the validity of any of these liscence numbers, so a boolean column makes much more sense. In general, we will also clean up columns that contain true and false values, but are currently encoded with characters (is_superhost and is_ideneity_verified) and columns that represent numeric values, but are encoded as characters (price, host_response_rate).

analysis_tab <- part_listings_tab %>%
  mutate(response_rate = as.double(gsub("%", "", host_response_rate))) %>%
  mutate(response_time = factor(host_response_time, levels = c("N/A", "a few days or more", "within a day", "within a few hours", "within an hour"))) %>%
  mutate(cost = as.double(gsub(",", "", substr(price, 2,9)))) %>%
  mutate(is_superhost = ifelse(host_is_superhost == "t", TRUE, FALSE)) %>%
  mutate(identity_verified = ifelse(host_identity_verified == "t", TRUE, FALSE)) %>%
  mutate(has_license = ifelse(str_detect(license, 
    "[0-9][0-9][0-9][0-9][0-9][0-9][0-9]"), TRUE, FALSE)) %>%
  mutate(has_about = ifelse(is.na(host_about), FALSE, TRUE)) %>%
  mutate(years_as_host = as.numeric(difftime(as.Date("2017-05-17"),
    as.Date(host_since), unit="weeks"))/52.25) %>%
  select(id, host_id, response_rate, response_time, is_superhost, host_listings_count,
    identity_verified, property_type, cost, minimum_nights, maximum_nights,
    availability_30, availability_60, availability_90, availability_365,
    number_of_reviews, reviews_per_month, review_scores_rating, review_scores_accuracy,
    review_scores_cleanliness, review_scores_checkin, review_scores_communication,
    review_scores_value, reviews_per_month, has_license, has_about, years_as_host,
    room_type, bed_type, longitude, latitude, zipcode, neighbourhood_cleansed, host_about,
    bed_type, house_rules, interaction, description, neighborhood_overview, space, summary, host_about)
analysis_tab
## # A tibble: 5,207 x 39
##          id  host_id response_rate response_time      is_superhost
##       <int>    <int>         <dbl> <fct>              <lgl>       
##  1 13824783 55020055           100 within an hour     T           
##  2 16740225 36722941           100 within an hour     F           
##  3 18125245 51669215           100 within an hour     F           
##  4  8362570 32837114           100 within a few hours T           
##  5   789867  2782694           100 within an hour     F           
##  6 16701336 36722941           100 within an hour     F           
##  7  8148031 43018961           100 within an hour     F           
##  8  6713059  2441740            NA N/A                F           
##  9 17620535   756599           100 within an hour     F           
## 10 12581064 23503681           100 within an hour     T           
## # ... with 5,197 more rows, and 34 more variables

Point 1: Airbnb is highly unregulated.

This particular point provides almost the entire motiviation behind our analysis. Let’s see if this is the case with a visualization.

ggplot(data=analysis_tab, aes(x=factor(has_license))) +
  geom_bar(stat="count") +
  labs(x = "Has a license")

As we can see, very, very few hosts in the Chicago area on Airbnb have gotten a guest property or vacation house license for their property. Of course, many of these listings will be for apartments, but many apartment leases prohibit short term rentals or sublets without the permission of the landlord to begin with. In general, this all aligns with the idea that it is hard for prospective guests to be sure that the proper regulations are being followed. Let’s move on to the reviews.

Point 2: Airbnb has inflated review scores.

ggplot(data=analysis_tab, aes(x=review_scores_rating)) +
  geom_density() + 
  labs(x = "Review scores")

num_obs = count(analysis_tab["review_scores_rating"])
num_obs_above_80 = count(subset(analysis_tab, review_scores_rating >= 80, select = review_scores_rating))

print(paste("Percentage of listings with a rating above 80:", num_obs_above_80[1]/num_obs[1]*100))
## [1] "Percentage of listings with a rating above 80: 84.3479930862301"

Raw review scores can be hard to interpret when about 84 percent of listings have a review score that lies at or above 80 out of 100, especially in the context of making a booking. A review score of 80 sounds good enough on the surface, but what we are really after is picking the best available host, so it could be a little misleading. This gives us the motivation to standardize the review score, allowing us to see more clearly which listings are reviewed better than others.

In addition to computing standardized review scores for all listings, we will also consider the distribution of standardized review scores for superhost listings and non-superhost listings to see if the superhost listings are as good as Airbnb would have us believe (this is somewhat of a trivial point, but hey). We can make a box and whisker plot to visualize the differences in these distributions as well. We will also check whether it costs more to book with a superhost in Chicago.

Point 3: Superhosts listings are both cheaper and reviewed better

all_reviews <- subset(analysis_tab, review_scores_rating != "NA" &
            number_of_reviews > 0 & availability_365 != "NA" &
            availability_365 > 0, 
            select=c(review_scores_rating, is_superhost,
            number_of_reviews, id, host_id, cost))
mean_all_reviews <- mean(all_reviews$review_scores_rating)
sd_all_reviews <- sd(all_reviews$review_scores_rating)

superhost_reviews <- subset(analysis_tab, review_scores_rating != "NA" &
            number_of_reviews > 0 & is_superhost, select=c(review_scores_rating,
            number_of_reviews, id, host_id))
mean_superhost_reviews <- mean(superhost_reviews$review_scores_rating)
sd_superhost_reviews <- sd(superhost_reviews$review_scores_rating)

ns_reviews <- subset(analysis_tab, review_scores_rating != "NA" & number_of_reviews
           > 0 & !is_superhost, select=c(review_scores_rating, number_of_reviews,
           id, host_id))
mean_ns_reviews <- mean(ns_reviews$review_scores_rating)
sd_ns_reviews <- sd(ns_reviews$review_scores_rating)


standardized_tab <- analysis_tab %>%
  mutate(z_review = (review_scores_rating - mean_all_reviews)/sd_all_reviews) %>%
  select(z_review, id, host_id, response_rate, response_time, is_superhost, host_listings_count,
    identity_verified, property_type, cost, minimum_nights, maximum_nights,
    availability_30, availability_60, availability_90, availability_365,
    number_of_reviews, reviews_per_month, review_scores_rating, review_scores_accuracy,
    review_scores_cleanliness, review_scores_checkin, review_scores_communication,
    review_scores_value, reviews_per_month, has_license, has_about, years_as_host,
    room_type, bed_type, longitude, latitude, zipcode, neighbourhood_cleansed, host_about,
    bed_type, house_rules, interaction, description, neighborhood_overview, space, summary, host_about)

ggplot(analysis_tab, aes(x=is_superhost, y=review_scores_rating)) + 
  
  geom_boxplot() +
  scale_y_continuous(limits = c(60, 100)) + 
  labs(x = "Superhost status", title = "Raw review score by superhost status")

ggplot(analysis_tab, aes(x=is_superhost, y=cost)) + 
  geom_boxplot() +
  scale_y_continuous(limits = c(60, 100)) + 
  labs(x = "Superhost status", title="Cost by superhost status for all listings")

We end up observing a higher median review score for superhost listings with much, much lower spread, and a lower median review score for non-superhost listings with much higher spread. This is all as expected. The boxplot also helps us visualize the differences discussed above.

The cost boxplot is pretty interesting though. From just this visualization, we can’t tell if the nonsuperhost listings are just “nicer,” leading to a higher cost, or if superhosts would actually charge less for a listing of equal quality. Superhosts take a larger portion of the sale, so it’s hard to say which it is. Whatever the case is, superhosts tend to be cheaper and have better reviews.

Clearly, prospective guests should try to stay with a superhost when possible.

This is a good lead into our next issue: Superhost listings may tend to be cheaper and have better ratings than non-superhost listings, but are they available more or less often than than their counterparts? If we were headed into Chicago for a major event, what are our options going to be, probabilistically speaking?

Point 4: Superhost and non-superhost listings have similar yearly availability per listing on average, and there are far more available nights in non-superhost listings than superhost listings in a given year.

#disregard inactive listings
availability <- subset(analysis_tab, number_of_reviews > 0 & availability_365 > 0,
            select= availability_365)
aggregate_availability <- sum(availability$availability_365)
mean_availability <- aggregate_availability/4237

superhost_availability <- subset(analysis_tab, number_of_reviews > 0 & 
                      availability_365 > 0 & is_superhost, select=
                      availability_365)
agg_superhost_availability <- sum(superhost_availability$availability_365)
mean_superhost_availability <- agg_superhost_availability/1248
portion_superhost <- agg_superhost_availability/aggregate_availability   

ns_availability <- subset(analysis_tab, number_of_reviews > 0 & availability_365 
                > 0 &!is_superhost, select= availability_365)
agg_ns_availability <- sum(ns_availability$availability_365)
mean_ns_availability <- agg_ns_availability/2989
portion_ns <- agg_ns_availability/aggregate_availability

print(paste("aggregate yearly availability:", aggregate_availability))
## [1] "aggregate yearly availability: 806937"
print(paste("mean yearly availability per listing:", mean_availability))
## [1] "mean yearly availability per listing: 190.450082605617"
print(paste("percentage of aggregate availability from superhosts:", portion_superhost*100))
## [1] "percentage of aggregate availability from superhosts: 29.7921646919152"
print(paste("percentage of aggregate availability not from superhosts:", portion_ns*100))
## [1] "percentage of aggregate availability not from superhosts: 70.2078353080848"
print(paste("mean availability for a superhost listing:", mean_superhost_availability))
## [1] "mean availability for a superhost listing: 192.63141025641"
print(paste("mean availability for a not superhost listing:", mean_ns_availability))
## [1] "mean availability for a not superhost listing: 189.53931080629"

The average yearly availabilty for superhost listings is actually very similar to that of non-superhost listings. This makes sense seeing as how most hosts should be scheduling themselves around the same major events (conventions, sporting events, etc). Independantly of what has already been booked, we expect there to be more non-superhosts to choose from when looking for an Airbnb listing at any given time (assuming equal or greater demand for superhost listings). This is important to our motivation since there may be times where all of the superhosts are full, and prospective guests will not have that option. It would be very helpful to prospective guests if we were able to sift through these non-superhost listings to find the especially “good” ones that are yet to be marked as superhosts. However, none of this is necessarily true unless demand for superhost listings is sufficiently large, which leads us to our next subsection.

We now use similar code to measure what portion of reviews fall under superhost listings compared to what portion of reviews fall under non-superhost listings as a proxy for measuring demand.

Point 5: Superhost listings receive more reviews than non-superhost listings

#disregard inactive listings
num_reviews <- subset(analysis_tab, number_of_reviews > 0 & availability_365 > 0,
            select = number_of_reviews)
aggregate_reviews <- sum(num_reviews$number_of_reviews)
average_reviews <- aggregate_reviews/4237

superhost_reviews <- subset(analysis_tab, number_of_reviews > 0 & availability_365
                  > 0 & is_superhost, select = number_of_reviews)
aggregate_superhost_reviews <- sum(superhost_reviews$number_of_reviews)
portion_superhost_reviews <- aggregate_superhost_reviews/aggregate_reviews
average_superhost_reviews <- aggregate_reviews/1248

ns_reviews <- subset(analysis_tab, number_of_reviews > 0 & availability_365 > 0 &
          !is_superhost, select = number_of_reviews)
aggregate_ns_reviews <- sum(ns_reviews$number_of_reviews)
portion_ns_reviews <- aggregate_ns_reviews/aggregate_reviews
average_ns_reviews <- aggregate_ns_reviews/2989

print(paste("total number of reviews on record:", aggregate_reviews))
## [1] "total number of reviews on record: 129519"
print(paste("average number of reviews per listing:", average_reviews))
## [1] "average number of reviews per listing: 30.568562662261"
print(paste("percentage of reviews from superhosts:", portion_superhost_reviews))
## [1] "percentage of reviews from superhosts: 0.424169426879454"
print(paste("percentage of reviews from non-superhosts", portion_ns_reviews))
## [1] "percentage of reviews from non-superhosts 0.575830573120546"
print(paste("average number of reviews for a superhost listing:",average_superhost_reviews))
## [1] "average number of reviews for a superhost listing: 103.78125"
print(paste("average number of reviews for a non-superhost listing:",average_ns_reviews))
## [1] "average number of reviews for a non-superhost listing: 24.9518233522917"
ggplot(analysis_tab, aes(x=is_superhost, y=years_as_host)) + 
  geom_boxplot() +
  scale_y_continuous(limits = c(0, 5)) + 
  labs(x = "Superhost status", title="Host experience by superhost status for all listings")

We have mentioned previously that superhost listings only comprise 30% of the available Airbnb listings in Chicago in a given year, based on our dataset. However, when we go and look at reviews, the superhost listings are responsible for 42% of all reviews. If we (boldly) assume each listing is equally likely to receive a review (response rate) and that there were similar trends among ex-Airbnb hosts that are not in our dataset, this would indicate higher demand for superhost listings. For online reviews in general, it is believed that people are more likely to leave a review when they have an very positive experience or a very negative experience, suggesting that these assumptions would require much more substantiation than we are capable of providing. Review counts are not the best proxy for measuring demand for superhost listings compared to non-superhost listings, but it’s the best we are going to get.

Side note: We can at the very least verify that non-superhosts and superhosts have similar distributions with respect to host age, ruling out the idea that superhosts have higher review counts because they have been around longer (see the above visualization) in Chicago.

In any case, it would still be extremely helpful to know, based on our data set, how we can pick through the non-superhost listings to find the best ones, especially when our non-superhost options do not have all that many (or any) reviews yet. Ideally, we want to know how to pick a non-superhost listing that will become a superhost listing in the future. We will refer to this as the “future superhost problem” for the rest of the analysis.

Next, we will focus on which subsections of the non-superhosts of Chicago have the “best” distribution of review scores to get a feel for how we can address the future superhost problem through the use of machine learning. This notion of a “better” distribution reaches outside of my area of expertise with respect to the level of statistics required to make such a point mathematically rigorous, so instead we will simply work off of data visualizations and make broad points from there.

1: Trends in non-superhosts review scores by host experience

ns_reviews <- subset(analysis_tab, review_scores_rating != "NA" & number_of_reviews
           > 0 & !is_superhost, select=c(review_scores_rating, number_of_reviews,
           id, host_id, years_as_host)) %>% 
           mutate(nearest_year = ceiling(years_as_host)) %>%
           select(nearest_year, review_scores_rating, number_of_reviews,
           id, host_id)
mean_ns_reviews <- mean(ns_reviews$review_scores_rating)
sd_ns_reviews <- sd(ns_reviews$review_scores_rating)

mean_of_ns_means = mean_ns_reviews
sd_of_ns_means = sd_ns_reviews/sqrt(count(ns_reviews)[[1]][1])



standardized_tab2 <- subset(standardized_tab, number_of_reviews != "NA" &
                  number_of_reviews > 0 & !is_superhost, select = c(z_review,
                  response_rate, number_of_reviews, identity_verified, has_about,
                  years_as_host, availability_365, availability_30,
                  availability_60, reviews_per_month, host_listings_count,
                  room_type, bed_type, longitude, latitude,
                  zipcode, neighbourhood_cleansed, cost))

standardized_tab3 <- standardized_tab2 %>% mutate(nearest_year =
                  ceiling(years_as_host)) %>% select(z_review, nearest_year)

ggplot(standardized_tab3, aes(x=factor(nearest_year), y=z_review)) + 
  geom_boxplot() +
  scale_y_continuous(limits = c(-7, 1)) + 
  labs(x = "Number of years as a host, rounded up", title="Distribution of Z-review scores based on years as host (non-superhosts)")

ggplot(data=standardized_tab3, aes(x=factor(nearest_year))) +
  geom_bar(stat="count") +
  labs(x = "Number of years as host")

We don’t really have enough data for each of the non-superhost listings experience groups to say much, so we will have to take another look at this once we have more data.

Side note: Recall that all of the data was standardized with respect to the mean and standard deviation of all of the data, and the data we are visualizing here is just the non-superhost data.

2: Identity verification seems to have little to no impact on the distribution of review scores

ggplot(standardized_tab2, aes(x=factor(identity_verified), y=z_review)) + 
  geom_boxplot() +
  scale_y_continuous(limits = c(-7.4, 2)) + 
  labs(x = "Host has verified identity", title="Distribution of Z-review scores based on identity verification status")

ggplot(data=standardized_tab2, aes(x=factor(identity_verified))) +
  geom_bar(stat="count") +
  labs(x = "Host has verified identity")

The boxplots here are so similar here that I don’t think it really warrants delving into this question in more detail without at least expanding the scope of the analysis to more than one city, but this is a counterintuitive observation nonetheless. As of right now, I don’t think that the identitiy verification status of a host will help us solve the future superhost problem.

3: Having an about section also has little to no impact on the distribution of review scores

ggplot(standardized_tab2, aes(x=factor(has_about), y=z_review)) + 
  geom_boxplot() +
  scale_y_continuous(limits = c(-7, 1.5)) + 
  labs(x = "Host has an about section", title="Distribution of Z-review scores based on about section status")

ggplot(data=standardized_tab2, aes(x=factor(has_about))) +
  geom_bar(stat="count") +
  labs(x = "Has an about section")

Again, very little difference here in the distribution of review scores. However, I think it is possible that we do see something more useful to us if we performed some form of text analysis on what the about section actually says. More on this a bit further down.

4: Bed and room type data not very useful

ggplot(standardized_tab2, aes(x=factor(bed_type), y=z_review)) + 
  geom_boxplot() +
  scale_y_continuous(limits = c(-4, 1)) + 
  labs(x = "Bed type", title="Distribution of Z-review scores based on bed type")

ggplot(standardized_tab2, aes(x=factor(room_type), y=z_review)) + 
  geom_boxplot() +
  scale_y_continuous(limits = c(-8, 1)) + 
  labs(x = "Room type", title="Distribution of Z-review scores based room type")

ggplot(data=standardized_tab2, aes(x=factor(bed_type))) +
  geom_bar(stat="count") +
  labs(x = "Bed type")

ggplot(data=standardized_tab2, aes(x=factor(room_type))) +
  geom_bar(stat="count") +
  labs(x = "Room type")

The overwhelming majority of Airbnb listings (95%+) in Chicago are for “real beds,” so it’s hard to really make any generalizations based on the boxplot. We make similar observations about the room type since the rating boxplots for “private room[s]” and “entire home/apt” are very similar, and they together comprise approximately 95% of the data here. The only room type with a different boxplot again has too few representative hosts to be considered all that meaningful.

5: Distributions accross (arbitrary) price ranges are very similar

standardized_tab3 <- standardized_tab2 %>% mutate(price_bracket = ifelse(cost <
                  100, "0-100", ifelse(cost < 200, "100-200", ifelse(cost < 300,
                  "200-300", ifelse(cost < 400, "300-400", "400+"))))) %>%
                  select(price_bracket, cost, z_review)

ggplot(standardized_tab3, aes(x=factor(price_bracket), y=z_review)) + 
  geom_boxplot() +
  scale_y_continuous(limits = c(-7, 1)) + 
  labs(x = "Price Bracket", title="Distribution of Z-review scores by price bracket")

ggplot(data=standardized_tab3, aes(x=factor(price_bracket))) +
  geom_bar(stat="count") +
  labs(x = "Price bracket")

The first three price brackets all have similar review score boxplots; they also capture the overwhelming majority of listings in Chicago. We could see if paying a hotel-esque price on Airbnb truly resulted in higher ratings if we had a little more data since the more expensive host subtypes are very few in number in Chicago. It is important to note that the choice of price range here is completely arbitrary.

6: Distribution of reviews slightly better for more responsive hosts

ggplot(standardized_tab, aes(x=response_time, y=z_review)) + 
  geom_boxplot() +
  scale_y_continuous(limits = c(-7, 1)) + 
  labs(x = "Response time", title="Distribution of Z-review scores by host response time")
## Warning: Removed 743 rows containing non-finite values (stat_boxplot).

ggplot(standardized_tab, aes(x=response_rate, y=z_review)) + 
  geom_point() +
  #scale_y_continuous(limits = c(-7, 1)) + 
  labs(x = "Guest response rate", title="Distribution of Z-review scores by guest response rate")

ggplot(data=standardized_tab, aes(x=response_time)) +
  geom_bar(stat="count") +
  labs(x = "Response time")

We obvserve a lower spread and a higher median for hosts that respond more quickly than after a few days, but we really do not have enough hosts falling under the “a few days or more” category to really take much value out of that. We will take another look at this once we have more data. We also generally observe a trend where most listings with high guest response rates (in terms of leaving a review) also have a good review score. This should be helpful to us in making a model.

7: Text analysis; host_about and word choice

library(tidytext)
library(lubridate)
library(ggplot2)
library(dplyr)
library(readr)

text_tab <- standardized_tab %>% subset(has_about, select = c(host_id, is_superhost, host_about)) %>% unnest_tokens(word, host_about) %>% anti_join(stop_words)
## Joining, by = "word"
text_tab %>%
  count(word, sort = TRUE) %>%
  filter(n > 600) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

text_tab_s <- text_tab %>% subset(is_superhost, select = c(word))
text_tab_s %>%
  count(word, sort = TRUE) %>%
  filter(n > 260) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  labs(title = "Word choice in the about section of superhosts", x = NULL) +
  coord_flip()

text_tab_ns <- text_tab %>% subset(!is_superhost, select = c(word))
text_tab_ns %>%
  count(word, sort = TRUE) %>%
  filter(n > 450) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  labs(title = "Word choice in the about section of non-superhosts", x = NULL) +
  coord_flip()

There are several ways to show that the word choice is similar for the about sections of non-superhosts and superhosts, but this one is the simplest. Sentiment analysis could be a bit more relevant here instead. However, the similar word choice will contribute to similar sentiment as well. But we won’t know for sure until we try!

8: Text analysis; description, summary, and sentiment analysis

library(tidytext)
library(lubridate)
library(ggplot2)
library(dplyr)
library(readr)

text_tab <- standardized_tab %>% subset(!is.na(summary) & !is.na(description), select = c(host_id, is_superhost, summary, description)) %>% mutate(summ_and_desc = paste(summary, description, sep = " ")) %>% unnest_tokens(word, summ_and_desc) %>% anti_join(stop_words) %>% select(is_superhost, host_id, word)
## Joining, by = "word"
text_tab %>%
  count(word, sort = TRUE) %>%
  filter(n > 3900) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

words_by_host_type <- text_tab %>%
  count(is_superhost, word, sort = TRUE) %>%
  ungroup()

listing_sentiments <- words_by_host_type %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  mutate(score2 = score * n / sum(n))

ggplot(listing_sentiments, aes(x=is_superhost, y=score2)) + 
  geom_boxplot() +
  scale_y_continuous(limits = c(-.00125, .00125)) + 
  labs(x = "Superhost status", title="Distribution of words used weighted for frequency of use")

listing_sentiments %>%
  filter((score2 <= -1.901954e-03 & !is_superhost) | (score2 <= -5.705863e-04 & is_superhost)) %>%
  mutate(word = reorder(word, score2)) %>%
  ggplot(aes(word, score2, fill = is_superhost)) +
  geom_col() +
  facet_wrap(~is_superhost, scales = "free_y") +
  xlab(NULL) +
  coord_flip()

listing_sentiments <- words_by_host_type %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(is_superhost) %>%
  summarize(score = sum(score * n) / sum(n))

text_tab %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(host_id, is_superhost) %>%
  summarize(score = sum(score)) %>%
  ggplot(aes(x=is_superhost, y=score)) + 
  geom_boxplot() +
  scale_y_continuous(limits = c(-10, 30)) + 
  labs(x = "Superhost status", title="Distribution of per-listing sentiment of summary and description sections by superhost status")

We have a whole lot of visualizations to go through here.

The first bar graph tells us that the word choice hasn’t change all that much from what we saw in the host_about section. The boxplot can be a little hard to understand. A negative outlier is either a very negative word used somewhat often, or a slightly negative word used quite often. The positive outliers are similar, just for positive words. We do not care very much about the non-outliers. We can see that the non-superhosts have many more negative outliers, suggesting that they use certain negative words more than superhosts do.

We then checked what these negative words were in the second bar graph, and found that there was no difference in which negative words were being chosen, just the frequency. The also showed us words which likely weren’t even used in a negative connotation (“fire” most likely used to tell the guest about a fire extinguisher, etc), demonstrating the weakness of this kind of analysis for this dataset. The final boxplot tells us that both superhosts and non-superhosts alike tend to write very positive descripion and summary sections, with the superhosts being the slightly more positive of the two.

This all makes sense, but it is a bit of a letdown since it means that these fields also will not be giving us much information as to what about a superhost listing distinguishes it from a non-superhost listing.

In summation, we are able to gather some unintuitive observations from this section:

* The higher in price we go, review distributions remain approximately similar to gradually increasing across diverse price brackets until we exceed $300 and approach entry-level city hotel pricing, at which point the positive shift in review distribution is more noticable.

* Almost all listings in Chicago are for real beds, so it is not very useful to incorperate bed type into our analysis.

* Almost all listings in Chicago are for entire apartments or private rooms, and their distributions appear to be very similar in boxplot visualizations, so it is hard for us to incorperate room type into our analysis.

* Having no about section or verified host identity appears to not negatively impact review scores, based on our visualizations

* Number of years as a non-superhost might matter a bit, but we need more data

* Host response time and guest response rate seem somewhat helpful for making a model

* Sentiment analysis of Airbnb listings is largely fruitless for distinguishing superhost listings and non-superhost listings

Part 3: A more robust analysis

At the end of part 2, we decided on several fields to use in our model to solve the future superhost problem, but we still needed more data to check if several others would be useful. We will import data from other American cities including Austin, Boston, Denver, Los Angeles, Nashville, New Orleans, New York City, Oakland, Portland, San Diego, San Francisco, Seattle, Minneapolis, and Washington DC in hopes of finding other useful fields for our model.

Let’s hop into the data once more. This time with a much more directed approach. Because we are going to be procesing so many CSVs, there has to be a wall of code below. We will try to keep things concise using functions, but even then it certainly won’t be pretty.

We will start off by defining our CSV processing functions.

read_and_clean_listings <- function(csv_name, csv_date, city_name) {
  l <- read_csv(csv_name) %>%
    mutate(response_rate = as.double(gsub("%", "", host_response_rate))) %>%
    mutate(response_time = factor(host_response_time, levels = c("N/A", "a few days or more", "within a day", "within a few hours", "within an hour"))) %>%
    mutate(city = city_name) %>%
    mutate(cost = as.double(gsub(",", "", substr(price, 2,9)))) %>%
    mutate(is_superhost = ifelse(host_is_superhost == "t", TRUE, FALSE)) %>%
    mutate(identity_verified = ifelse(host_identity_verified == "t", TRUE, FALSE)) %>%
    mutate(has_license = ifelse(!is.na(license) & (validate_license_by_city(license, city_name)), TRUE, FALSE)) %>%
    mutate(has_about = ifelse(is.na(host_about), FALSE, TRUE)) %>%
  mutate(years_as_host = as.numeric(difftime(as.Date(csv_date),
    as.Date(host_since), unit="weeks"))/52.25) %>%
  select(id, host_id, response_rate, is_superhost, host_listings_count,
    identity_verified, property_type, cost, minimum_nights, maximum_nights,
    availability_30, availability_60, availability_90, availability_365,
    number_of_reviews, review_scores_rating, review_scores_accuracy,
    review_scores_cleanliness, review_scores_checkin, review_scores_communication,
    review_scores_value, reviews_per_month, has_license, has_about, years_as_host,
    room_type, bed_type, longitude, latitude, zipcode, neighbourhood_cleansed,
    bed_type, city, response_time)
  l
}

#We could very easily combine several of these if statements, but I wanted to make things easier to modify if I ever needed to.
validate_license_by_city <- function(license, city_name) {
  result = FALSE
  if(city_name == "Austin") {
    result = str_detect(license, "[0-9][0-9][0-9][0-9][0-9][0-9]")
  }
  if(city_name == "Boston") {
    result = str_detect(license, "[0-9][0-9][0-9][0-9][0-9][0-9]")
  }
  if (city_name == "Los Angeles") {
    result = str_detect(license, "[0-9][0-9][0-9][0-9][0-9]")
  }
  if (city_name == "Nashville") {
    result = str_detect(license, "[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]") 
  }
  if (city_name == "Oakland") {
    result = str_detect(license, "[0-9][0-9][0-9][0-9][0-9][0-9]")
  }
  if (city_name == "New Orleans") {
    result = str_detect(license, "[0-9][0-9][0-9][0-9][0-9]")
  }
  if (city_name == "New York") {
    result = str_detect(license, "[0-9][0-9][0-9]")
  }
  if (city_name == "Portland") {
    result = str_detect(license, "[0-9][0-9][0-9]")
  }
  if (city_name == "San Diego") {
    result = str_detect(license, "[0-9][0-9][0-9][0-9][0-9][0-9]")
  }
  if (city_name == "San Francisco") {
    result = str_detect(license, "[0-9][0-9][0-9][0-9][0-9][0-9][0-9]")
  }
  if (city_name == "Seattle") {
    result = str_detect(license, "[0-9][0-9][0-9]")
  }
  if (city_name == "Minneapolis") {
    result = str_detect(license, "[0-9][0-9][0-9]")
  }
  if (city_name == "DC") {
    result = str_detect(license, "[0-9][0-9][0-9]")
  }
  result
}

Now we will actually give these functions some data.

library(tidyverse)
library(stringr)

austin_listings_tab <- read_and_clean_listings("listings-2.csv", "2018-05-14", "Austin")
boston_listings_tab <- read_and_clean_listings("listings-3.csv", "2018-05-17", "Boston")
la_listings_tab <- read_and_clean_listings("listings-4.csv", "2018-05-09", "Los Angeles") 
nashville_listings_tab <- read_and_clean_listings("listings-5.csv", "2018-05-17", "Nashville") 
oakland_listings_tab <- read_and_clean_listings("listings-7.csv", "2018-05-17", "Oakland")
new_orleans_listings_tab <- read_and_clean_listings("listings-17.csv", "2018-05-09", "New Orleans")
ny_listings_tab <- read_and_clean_listings("listings-8.csv", "2018-06-03", "New York")
portland_listings_tab <- read_and_clean_listings("listings-9.csv", "2018-05-13", "Portland") 
san_diego_listings_tab <- read_and_clean_listings("listings-10.csv", "2018-05-17", "San Diego")
san_francisco_listings_tab <- read_and_clean_listings("listings-11.csv", "2018-05-09", "San Francisco")
seattle_listings_tab <- read_and_clean_listings("listings-12.csv", "2018-05-17", "Seattle") 
minneapolis_listings_tab <- read_and_clean_listings("listings-13.csv", "2018-05-09", "Minneapolis") 
dc_listings_tab <- read_and_clean_listings("listings-14.csv", "2018-05-18", "DC") 

The above code is not very interesting to talk about as it’s mostly a rehashing of what we did for Chicago, but now we get to combine them all using a single rbind() call.


all_listings <- rbind(austin_listings_tab, boston_listings_tab, la_listings_tab, nashville_listings_tab, new_orleans_listings_tab, ny_listings_tab, oakland_listings_tab, portland_listings_tab, san_diego_listings_tab, san_francisco_listings_tab, seattle_listings_tab, minneapolis_listings_tab, dc_listings_tab) 

For all of the US cities for which we have data, are their review distributions inflated? Let’s see.

ggplot(data=all_listings, aes(x=review_scores_rating, color=city)) +
  geom_density() + 
  scale_x_continuous(limits = c(70, 100)) + 
  labs(x = "Review scores")

Certainly.

Now, let’s standardize everything by city.

mean_and_sd <- subset(all_listings, review_scores_rating != "NA" & number_of_reviews > 0, select=c(review_scores_rating, city)) %>% group_by(city) %>% summarize(avg = mean(review_scores_rating), standard_dev = sd(review_scores_rating))

#Would probably be easier on the eyes to standardize everything in its own table and then rbind() it all back together, but a long if statement still gets the job
#done.
all_listings_standardized <- subset(all_listings, review_scores_rating != "NA" &
            number_of_reviews > 0 & !is_superhost, select=c(review_scores_rating, city, id, years_as_host, response_rate, response_time, number_of_reviews)) %>% 
  mutate(z_review_score = 
           ifelse(city == "Austin",   (review_scores_rating-mean_and_sd[[2]][1])/mean_and_sd[[3]][1], ifelse(city == "Boston",   (review_scores_rating-mean_and_sd[[2]][2])/mean_and_sd[[3]][2], ifelse(city == "DC",   (review_scores_rating-mean_and_sd[[2]][3])/mean_and_sd[[3]][3], ifelse(city == "Denver", (review_scores_rating-mean_and_sd[[2]][4])/mean_and_sd[[3]][4], ifelse(city == "Los Angeles", (review_scores_rating-mean_and_sd[[2]][5])/mean_and_sd[[3]][5], ifelse(city == "Minneapolis",   (review_scores_rating-mean_and_sd[[2]][6])/mean_and_sd[[3]][6], ifelse(city == "Nashville", (review_scores_rating-mean_and_sd[[2]][7])/mean_and_sd[[3]][7], ifelse(city == "New Orleans",   (review_scores_rating-mean_and_sd[[2]][8])/mean_and_sd[[3]][8], ifelse(city == "New York City", (review_scores_rating-mean_and_sd[[2]][9])/mean_and_sd[[3]][9], ifelse(city == "Oakland", (review_scores_rating-mean_and_sd[[2]][10])/mean_and_sd[[3]][10], ifelse(city == "Portland", (review_scores_rating-mean_and_sd[[2]][11])/mean_and_sd[[3]][11], ifelse(city == "San Diego",   (review_scores_rating-mean_and_sd[[2]][12])/mean_and_sd[[3]][12], ifelse(city == "San Francisco", (review_scores_rating-mean_and_sd[[2]][13])/mean_and_sd[[3]][13], review_scores_rating-mean_and_sd[[2]][14])/mean_and_sd[[3]][14]))))))))))))) %>%
  mutate(nearest_year = ifelse(is.na(years_as_host), 0, ceiling(years_as_host))) #%>%
  #select(z_review_score, id, city, nearest_year, response_rate, response_time, number_of_reviews)

Let’s revisit some of our visualizations from Part 2 and see if the trend is any more apparent this time around.

all_listings_standardized_final <- left_join(all_listings_standardized, all_listings, by=c("id", "city"))

ggplot(all_listings_standardized_final, aes(x= factor(nearest_year), y=as.numeric(z_review_score))) + 
  geom_boxplot() + 
  scale_y_continuous(limits = c(-2, 1)) + 
  labs(x = "Number of years as host, rounded up", y = "Z review score")

ggplot(data=all_listings_standardized_final, aes(x=factor(nearest_year))) +
  geom_bar(stat="count") +
  labs(x = "Number of years as host, rounded up")

This is a much more stark and clear trendline. Something about the subtype of Airbnb listings that have only existed on the service for one or fewer years is causing them to have a higher median review score than any other subtype by years of experience with the service. This is very useful to us because we can expect more Airbnb hosts in their first year to meet the review score requirement for superhost promotion than hosts of other levels of experience. If we were to double our work, we could import and clean last year’s data to see if the listings that made the switch from non-superhost to superhost over the duration of that year tended to be listings that were still in their first year with the service. This would be a significant step forward in the future superhost problem as it would then suggest that prospective guests might want to prioritize hosts with great early track records over hosts with longer track records if they are aiming for a future suprhost in the event that all of the superhosts were unavailable.

This also helps us define our machine learning problem more specifically: We want to predict which non-superhost will become a superhost in the next data collection cycle. But this will have to come in the next, next part!

1: Airbnb hosts are unregulated in general

ggplot(data=all_listings, aes(x=has_license)) +
  geom_bar() 

Our initial observation about the lack of regulation is confirmed here.

2: Identity verification status does not impact the distribution of review scores in general

ggplot(all_listings_standardized_final, aes(x=factor(identity_verified), y=as.numeric(z_review_score))) + 
  geom_boxplot() +
  scale_y_continuous(limits = c(-2, 1)) + 
  labs(x = "Host has verified identity", title="Distribution of Z-review scores based on identity verification status")

ggplot(data=all_listings_standardized_final, aes(x=factor(identity_verified))) +
  geom_bar(stat="count") +
  labs(x = "Host has verified identity")

We end up with analogous observations to those we made in Part 2. Identity verification status does not appear to be an important factor in the distribution review scores.

3: Other cities are also lacking in listings which feature other bed/room types

ggplot(all_listings_standardized_final, aes(x=factor(bed_type), y=as.numeric(z_review_score))) + 
  geom_boxplot() +
  scale_y_continuous(limits = c(-2, 2)) + 
  labs(x = "Bed type", title="Distribution of Z-review scores based on bed type")

ggplot(all_listings_standardized_final, aes(x=factor(room_type), y=as.numeric(z_review_score))) + 
  geom_boxplot() +
  scale_y_continuous(limits = c(-2, 2)) + 
  labs(x = "Room type", title="Distribution of Z-review scores based room type")

ggplot(data=all_listings_standardized_final, aes(x=factor(bed_type))) +
  geom_bar(stat="count") +
  labs(x = "Bed type")

ggplot(data=all_listings_standardized_final, aes(x=factor(room_type))) +
  geom_bar(stat="count") +
  labs(x = "Room type")

Similar conclusions to those of part 2; we are still severly lacking in data for shared rooms and all bed types other than “real bed,” so there isn’t much we can fairly compare.

4: Having an about section does not impact the distribution of review scores in general

ggplot(all_listings_standardized_final, aes(x=factor(has_about), y=z_review_score)) +
  geom_boxplot() +
  scale_y_continuous(limits = c(-4, 1)) + 
  labs(x = "Host has an about section", title="Distribution of Z-review scores based on about section status")

ggplot(data=all_listings_standardized_final, aes(x=factor(has_about))) +
  geom_bar(stat="count") +
  labs(x = "Has an about section")

We end up with similar results to the previous box plot discussion. Our result here is merely a scaled up version of what we saw in Part 2. The distributions just aren’t all that different.

I decided against reexamining price bracket in the same way because costs of living vary quite a bit from city to city, and my prior price ranges were arbitrary. Similarly, performing text mining on all of the cities at once would be likely not be very useful because we did not see anything meaningful in Chicago.

Part 4: A focused analysis; Time spent as host and becoming a superhost

As we discussed in the previous section, we have a hunch that that Airbnb selects their superhosts while they are still in their first year with Airbnb as a non-superhost. If this can be shown using their data, we will have at least partially solved the future superhost problem, and importantly highlighted the excellence of Airbnb non-superhosts who have good early track records for prospective Airbnb guests who do not have the luxury of staying with a superhost.

Even though the purpose of this particular part entails using fewer columns from the raw data than it did last time, we will clean everything about the data all the same (in case we need it later). We will also make use of the very same CSV processing functions we defined in Part 3, so refer back to them if you’ve forgotten what the code is doing.

Anyway, this involves us importing and cleaning Airbnb’s data from their previous data collection cycle. So let’s get right to it!

austin_listings_tab2 <- read_and_clean_listings("listings-2.1.csv", "2017-03-07", "Austin") 
boston_listings_tab2 <- read_and_clean_listings("listings-3.1.csv", "2016-09-07", "Boston") 
la_listings_tab2 <- read_and_clean_listings("listings-4.1.csv", "2017-04-02", "Los Angeles") 
nashville_listings_tab2 <- read_and_clean_listings("listings-5.1.csv", "2016-09-06", "Nashville")
#denver_listings_tab2 <- read_and_clean_denver("listings-16.1.csv", "2016-05-16", "Denver") 
oakland_listings_tab2 <- read_and_clean_listings("listings-7.1.csv", "2016-05-04", "Oakland") 
new_orleans_listings_tab2 <- read_and_clean_listings("listings-17.1.csv", "2017-05-02", "New Orleans")
ny_listings_tab2 <- read_and_clean_listings("listings-8.1.csv", "2017-06-02", "New York")
portland_listings_tab2 <- read_and_clean_listings("listings-9.1.csv", "2017-05-07", "Portland")
san_diego_listings_tab2 <- read_and_clean_listings("listings-10.1.csv", "2016-07-07", "San Diego")
san_francisco_listings_tab2 <- read_and_clean_listings("listings-11.1.csv", "2017-05-02", "San Francisco")
seattle_listings_tab2 <- read_and_clean_listings("listings-12.1.csv", "2016-01-04", "Seattle")
minneapolis_listings_tab2 <- read_and_clean_listings("listings-13.1.csv", "2018-04-07", "Minneapolis")
dc_listings_tab2 <- read_and_clean_listings("listings-14.1.csv", "2018-05-18", "DC")
all_listings2 <- rbind(austin_listings_tab2, boston_listings_tab2, la_listings_tab2, nashville_listings_tab2, new_orleans_listings_tab2, ny_listings_tab2, oakland_listings_tab2, portland_listings_tab2, san_diego_listings_tab2, san_francisco_listings_tab2, seattle_listings_tab2, minneapolis_listings_tab2, dc_listings_tab2)

#can combine all cities and take specific portion of it using a df <- subset(rbind(...),is_superhost==true, select())

So, the superhosts in the most recent data (5/2018 or 6/2018) fit into one of three catagories: They are…

* Type 1: new superhosts in the second data collection cycle who joined Airbnb before the first data collection cycle (collected between 2016 and 2017)

* Type 2: new superhosts in the second data collection cycle who joined Airbnb after the first data collection cycle

* Type 3: old superhosts who were superhosts before the prior data collection cycle as well

Our interest is in superhosts fitting the first two types. We want to examine approximately how long those particular superhosts were non-superhosts before being recognized and promoted by Airbnb.

#this computes type 1 superhost listings
new_superhosts <- subset(merge(all_listings, all_listings2, all=TRUE, by="id"), is_superhost.x==TRUE & is_superhost.y==FALSE, select=c(id, host_id.x, city.x, years_as_host.y, number_of_reviews.y, host_listings_count.y, is_superhost.y, years_as_host.x, number_of_reviews.x, host_listings_count.x, is_superhost.x))

#this computes type 2 superhost lisitings 
new_superhosts2 <- subset(merge(all_listings, all_listings2, all=TRUE, by="id"), is_superhost.x==TRUE & (is.na(is_superhost.y) & ifelse(city.x == "Austin", (years_as_host.x < 62/52.25), ifelse(city.x == "Boston", (years_as_host.x < 89/52.5), ifelse(city.x == "DC", (years_as_host.x < 63/52.5), ifelse(city.x == "Denver", (years_as_host.x <101/52.25), ifelse(city.x == "Los Angeles", (years_as_host.x < 58/52.25), ifelse(city.x == "Minneapolis", (years_as_host.x < 5/52.25), ifelse(city.x == "Nashville",   (years_as_host.x < 89/52.25), ifelse(city.x == "New Orleans", (years_as_host.x < 54/52.25), ifelse(city.x == "New York City",   (years_as_host.x < 53/52.25), ifelse(city.x == "Oakland", (years_as_host.x <107/52.25), ifelse(city.x == "Portland",   (years_as_host.x < 54/52.25), ifelse(city.x == "San Diego", (years_as_host.x < 97/52.25), ifelse(city.x == "San Francisco", (years_as_host.x < 54/52.25), years_as_host.x < 124/52.25)))))))))))))), select=c(id, host_id.x, city.x, years_as_host.y, number_of_reviews.y, host_listings_count.y, is_superhost.y, years_as_host.x, number_of_reviews.x, host_listings_count.x, is_superhost.x))

all_new_superhosts <- rbind(new_superhosts, new_superhosts2)

hosts_by_city <- all_new_superhosts %>% #in case hosts have multiple listings accross multiple cities
  group_by(host_id.x) %>%
  summarise(n_distinct(id))

Let’s get right to it.

distinct_new_superhosts <- all_new_superhosts %>% distinct(host_id.x, .keep_all = TRUE) %>%
  mutate(years_as_host.y = ifelse(is.na(years_as_host.y), years_as_host.x, years_as_host.y))

ggplot(data=distinct_new_superhosts, aes(x=years_as_host.x, color=city.x)) +
  geom_density() +
  scale_x_continuous(breaks = c(2,4,6,8,10)) + 
  labs(title="Years as host before being promoted (at latest)")

ggplot(data=distinct_new_superhosts, aes(x=years_as_host.x)) +
  geom_density() +
  scale_x_continuous(breaks = c(2,4,6,8,10)) + 
  labs(title="Years as host before being promoted (at latest)")

ggplot(data=distinct_new_superhosts, aes(x=factor(ceiling(years_as_host.x)))) +
  geom_bar(stat="count") +
  labs(x = "Number of years as host, rounded up", title="Years as host before being promoted (at latest)")

ggplot(data=distinct_new_superhosts, aes(x=years_as_host.y, color=city.x)) +
  geom_density() +
  scale_x_continuous(breaks = c(2,4,6,8,10)) + 
  labs(title="Years as host before being promoted (at earliest)")

ggplot(data=distinct_new_superhosts, aes(x=years_as_host.y)) +
  geom_density() +
  scale_x_continuous(breaks = c(2,4,6,8,10)) + 
  labs(title="Years as host before being promoted (at earliest)")

ggplot(data=distinct_new_superhosts, aes(x=factor(ceiling(years_as_host.y)))) +
  geom_bar(stat="count") +
  labs(x = "Number of years as host, rounded up", title="Years as host before being promoted (at earliest)")

Before we can really start talking about these visualizations, we first have to understand how the superhost promotion process works. At the end of every quarter of a year, Airbnb does an assessment of each host’s performance over the past year (or less if the host has spent less than one year with Airbnb). For those who met their requirements for being an outstanding host, they receive a promotion to superhost over the next 10 days. However, becoming a superhost entials much more than meeting a rating requirement. There is also a host response rate requirement, a minimum number of visits requirement, and several others. These other requirements could very easily take some hosts more than a year to meet, which helps explain why these visualizations do not exactly mimic the ones we did for the distribution of Z rating scores by host experience in part 3; the visualizations from part 3 had a much more sudden drop off in Z rating score occurring between first year hosts and the rest while these visualizations feature a much more gradual drop off with regards to superhost promotions. That being said, these visuals are still pretty telling in that they seem to confirm the idea that Airbnb tends to promote hosts to superhosts sooner rather than later, but we are still struggling a bit to make this more concrete. In the next part, we will see if this relatively simple intuition is enough to make a predictor for superhost promotion.

Weaknesses with our data: * The earlier listing observations (~2016-2017) were all taken at a wide variety of times, whereas the later listing observations (~5/2018) were all taken within the same two month span. This means we were considering the following different length timespans:

- Austin: ~62 weeks between 2017 and 2018 observations

- Boston: ~88 weeks between 2016 and 2018 observations

- LA: ~57 weeks between 2017 and 2018 observations

- Nashville: ~88 weeks between 2016 and 2018 observations

- Oakland: ~106 weeks between 2016 and 2018 observations

- New Orleans: ~53 weeks between 2017 and 2018 observations

- New York: ~52 weeks between 2017 and 2018 observations

- Portland: ~53 weeks between 2017 and 2018 observations

- Other time lengths: 97 (SD), 53 (SF), 123 (SEA), 4 (MN), 62 (DC)

This means that our data is somewhat skewed by hosts whose transition to being superhosts we could not observe quickly enough. This is particularly problematic because it means we are either overestimating or underestimating the point at which a host was promoted by at most two full years. For example, suppose x non-superhosts turn into superhosts in January, y do in the month of February, and z do for the month of March for city ABC. That is, promotions occurr monthly. Further suppose that our two datasets for ABC are comprised of the listing space during late December and late March. We will observe approximately x+y+z new superhosts, but x superhosts will have about two more months on their years_as_host than needed, and y superhosts will have about one more month on their years_as_host than needed. This shifts our observed time spent as a non-superhost before a superhost up by 1 to 2 full months for x+y of them, as we had no way of knowing exactly when the promotion ocurred. On the other hand, we can use the host experience level at the beginning of the time period (when they were not even promoted yet) and plot that, and instead similarly underestimate the time it took for a host to be promoted. If we were able to use shorter time intervals, then we could know exactly when hosts were promoted. For this example, it means we would want 1 month time intervals. For our Airbnb data set, this means we want 3 month time intervals. Most importantly, time intervals of a much shorter length would enable us to answer our machine learning problem in a much more meaningful way. It would be significantly better to predict which non-superhost will get promoted a month from now than which non-superhost will be promoted a year from now, thinking from the perspective of a prospective guest.

Now, what else can we show?

1: The overwhelming majority of new superhosts only have 1 listing.

ggplot(data=hosts_by_city, aes(x=hosts_by_city[[2]])) +
  geom_bar() +
  scale_x_discrete(limits = c(1,2,3,4,5)) + 
  coord_cartesian(xlim = c(1, 5)) +
  xlab("number of listings") +
  ylab("number of hosts")

Part 4 (continued): An even more focused analysis

Last time we observed something very close to what we were expecting with respect to the experience level of a superhost (in years) at the time of promotion. We will need to use shorter time intervals to give ourselves the oppurtunity to observe something that confirms our intuition a little better.

Into the data…

austin_listings_tab3 <- read_and_clean_listings("listings-2.3.csv", "2018-08-14", "Austin") 
boston_listings_tab3 <- read_and_clean_listings("listings-3.3.csv", "2018-08-17", "Boston")
la_listings_tab3 <- read_and_clean_listings("listings-4.3.csv", "2018-08-07", "Los Angeles") 
nashville_listings_tab3 <- read_and_clean_listings("listings-5.3.csv", "2018-08-16", "Nashville")
#denver_listings_tab3 <- read_and_clean_denver("")
#read and clean denver; unable to find suitable time interval
oakland_listings_tab3 <- read_and_clean_listings("listings-7.3.csv", "2018-08-16", "Oakland")
new_orleans_listings_tab3 <- read_and_clean_listings("listings-17.3.csv", "2018-08-07", "New Orleans")
ny_listings_tab3 <- read_and_clean_listings("listings-8.3.csv", "2018-08-06", "New Orleans")
portland_listings_tab3 <- read_and_clean_listings("listings-9.3.csv", "2018-08-14", "Portland")
san_diego_listings_tab3 <- read_and_clean_listings("listings-10.3.csv", "2018-08-16", "San Diego") 
san_francisco_listings_tab3 <- read_and_clean_listings("listings-11.3.csv", "2018-08-06", "San Francisco")
seattle_listings_tab3 <- read_and_clean_listings("listings-12.3.csv", "2018-08-16", "Seattle")
minneapolis_listings_tab3 <- read_and_clean_listings("listings-13.3.csv", "2018-08-06", "Minneapolis")
dc_listings_tab3 <- read_and_clean_listings("listings-14.3.csv", "2018-08-08", "DC")
all_listings3 <- rbind(austin_listings_tab3, boston_listings_tab3, la_listings_tab3, nashville_listings_tab3, new_orleans_listings_tab3, ny_listings_tab3, oakland_listings_tab3, portland_listings_tab3, san_diego_listings_tab3, san_francisco_listings_tab3, seattle_listings_tab3, minneapolis_listings_tab3, dc_listings_tab3)

#type 1
new_superhosts3 <- subset(merge(all_listings, all_listings3, all=TRUE, by="id"), is_superhost.x==FALSE & is_superhost.y==TRUE, select=c(id, host_id.x, city.x, years_as_host.y, number_of_reviews.y, host_listings_count.y, is_superhost.y, years_as_host.x, number_of_reviews.x, host_listings_count.x, is_superhost.x))

#this computes type 2 superhost lisitings 
new_superhosts4 <- subset(merge(all_listings, all_listings3, all=TRUE, by="id"), is_superhost.x==TRUE & (is.na(is_superhost.y) & ifelse(city.x == "Austin", (years_as_host.x < 6/52.25), ifelse(city.x == "Boston", (years_as_host.x < 6/52.5), ifelse(city.x == "DC", (years_as_host.x < 6/52.5), ifelse(city.x == "Denver", (years_as_host.x <101/52.25), ifelse(city.x == "Los Angeles", (years_as_host.x < 6/52.25), ifelse(city.x == "Minneapolis", (years_as_host.x < 6/52.25), ifelse(city.x == "Nashville",   (years_as_host.x < 6/52.25), ifelse(city.x == "New Orleans", (years_as_host.x < 6/52.25), ifelse(city.x == "New York City",   (years_as_host.x < 10/52.25), ifelse(city.x == "Oakland", (years_as_host.x <6/52.25), ifelse(city.x == "Portland",   (years_as_host.x < 6/52.25), ifelse(city.x == "San Diego", (years_as_host.x < 6/52.25), ifelse(city.x == "San Francisco", (years_as_host.x < 6/52.25), years_as_host.x < 6/52.25)))))))))))))), select=c(id, host_id.x, city.x, years_as_host.y, number_of_reviews.y, host_listings_count.y, is_superhost.y, years_as_host.x, number_of_reviews.x, host_listings_count.x, is_superhost.x))

all_new_superhosts2 <- rbind(new_superhosts3, new_superhosts4)

With everything cleaned up, we can look at the same things we did in the previous part to see if the trend is any more clear this time around. Most time intervals spanned one month, and none spanned more than 3 months.

distinct_new_superhosts2 <- all_new_superhosts2 %>% distinct(host_id.x, .keep_all = TRUE) %>%
  mutate(years_as_host.y = ifelse(is.na(years_as_host.y), years_as_host.x, years_as_host.y))

ggplot(data=distinct_new_superhosts2, aes(x=years_as_host.x, color=city.x)) +
  geom_density() +
  scale_x_continuous(breaks = c(2,4,6,8,10)) + 
  labs(title="Years as host before being promoted (at latest)")

ggplot(data=distinct_new_superhosts2, aes(x=years_as_host.x)) +
  geom_density() +
  scale_x_continuous(breaks = c(2,4,6,8,10)) + 
  labs(title="Years as host before being promoted (at latest)")

ggplot(data=distinct_new_superhosts2, aes(x=factor(ceiling(years_as_host.x)))) +
  geom_bar(stat="count") +
  labs(x = "Number of years as host, rounded up", title="Years as host before being promoted (at latest)")

ggplot(data=distinct_new_superhosts2, aes(x=years_as_host.y, color=city.x)) +
  geom_density() +
  scale_x_continuous(breaks = c(2,4,6,8,10)) + 
  labs(title="Years as host before being promoted (at earliest)")

ggplot(data=distinct_new_superhosts2, aes(x=years_as_host.y)) +
  geom_density() +
  scale_x_continuous(breaks = c(2,4,6,8,10)) + 
  labs(title="Years as host before being promoted (at earliest)")

ggplot(data=distinct_new_superhosts2, aes(x=factor(ceiling(years_as_host.y)))) +
  geom_bar(stat="count") +
  labs(x = "Number of years as host, rounded up", title="Years as host before being promoted (at earliest)")

We end up observing a slightly less clear trend this time around. We would need to examine more stretches of time to completely rule out or confirm the idea that new hosts with strong early track records are the next best things to superhosts for prospective Airbnb guests (in terms of quality), but I hope that this exploratory data analysis has left you with some things to think about nonetheless. Also note that the “at latest” and “at earliest” visualizations look nearly identical because of how short our time interval was.

However, with the shorter time interval, we can actually do something interesting with our exploratory analysis, and create a model to solve the future superhost problem.

Part 5: Using machine learning to predict if a host will become a superhost in the next data cycle

When I first started writing this data analysis piece, the data available from InsideAirbnb was not published frequently enough on their website to try applying any machine learning to the future superhost problem. In the months following this, they have started to upload their data on a monthly basis for many American cities, so I can finally do what I set out to do. Just wanted to give a big thank you to InsideAirbnb for being so open about their data, and using their resources to post that data on a frequent basis for the public.

Step 1: Defining the target function

The goal here is going to be to learn the function f which takes a host and outputs “True” or “False” to indicate if the host is predicted to be promoted to a superhost in the next data collection cycle.

Step 2: Which model should we use?

I think a natural choice here is a decision tree. We have a lot of features, and we also expect a fair amount of colinearity due to the nature of the superhost promotion requirements. We also most likely have rather noisy data, even after all of the exploration we did. Of course, we will be using boosting and random forests to make the model as accurate as possible.

Step 3: Create a table containing only columns which will be useful for prediction

The following attributes must meet a threshold in order for Airbnb to promote a host: review score, response rate

I believe following attributes encode additional information as to who gets promoted: years as host, reviews per month, number of reviews, host response time

library(randomForest)
library(tree)
library(dplyr)

#all_listings is may 2018
#all_listings3 is august 2018
#they have review_scores_rating,  response_rate, number_of_reviews, reviews_per_month, years_as_host, response_time
#they also have cost

training_listings <- all_listings_standardized_final %>% filter(!is_superhost & !is.na(z_review_score))
training_listings <- merge(training_listings, new_superhosts3, all = TRUE, by="id") %>%
  mutate(promoted = ifelse(is.na(is_superhost.y), FALSE, TRUE)) %>% 
  subset(!is.na(response_time.x), select=c(id, promoted, z_review_score, response_rate.x, reviews_per_month, years_as_host.x.x, response_time.x))

training_listings$response_rate.x[is.na(training_listings$response_rate.x)] <- 0
training_listings$z_review_score[is.na(training_listings$z_review_score)] <- 0
training_listings$reviews_per_month[is.na(training_listings$reviews_per_month)] <- 0
training_listings$years_as_host.x.x[is.na(training_listings$years_as_host.x.x)] <- 0

set.seed(121)
test_subset <- training_listings %>% group_by(promoted) %>% sample_frac(.2) %>% ungroup()
training_listings <- training_listings %>%
  anti_join(test_subset, by='id')

Let’s see how we do with random forests.

correct_falses   = integer(5)
correct_trues    = integer(5)
incorrect_falses = integer(5)
incorrect_trues  = integer(5)
class_error_f    = double(5)
class_error_t    = double(5)
for(mtry in 1:5){
  fit = randomForest(as.factor(promoted)~., data = training_listings %>% select(-id), mtry=mtry, ntree = 350)
  pred = predict(fit, test_subset)
  a = with(test_subset, table(pred, test_subset$promoted))
  #print(a)
  correct_falses[mtry] = a[[1]]
  correct_trues[mtry]  = a[[4]]
  incorrect_falses[mtry]  = a[[3]]
  incorrect_trues[mtry]   = a[[2]]
  class_error_f[mtry] = incorrect_falses[mtry]/(incorrect_falses[mtry]+correct_falses[mtry])
  class_error_t[mtry] = incorrect_trues[mtry]/(incorrect_trues[mtry]+correct_trues[mtry])
  print("finished an iteration")
}
## [1] "finished an iteration"
## [1] "finished an iteration"
## [1] "finished an iteration"
## [1] "finished an iteration"
## [1] "finished an iteration"
#tables our error rates
formatted_class_errors <- as_tibble(t(rbind(t(as_tibble(class_error_f)), t(as_tibble(class_error_t)), t(as_tibble(1:5)))))

#placed in if statement in case new seed  or test set is used
if(is.nan(formatted_class_errors$value1[1])) {
  formatted_class_errors$value1[1] <- 1.00  
}

#Some error visualizations
ggplot(data=formatted_class_errors, aes(x=value2, y= value)) +
  geom_line() +
  scale_y_continuous(limits = c(0, .5)) +
  labs(title="Misclassification rate of negative results", x ="mtry value", y = "class error (false)")

ggplot(data=formatted_class_errors, aes(x=value2, y= value1)) +
  geom_line() +
  scale_y_continuous(limits = c(.2, 1))+
  labs(title="Misclassification rate of positive results", x ="mtry value", y = "class error (true)")

The first graph shows the portion of our non-promotion (false) predictions that were incorrect. The second graph shows the portion of our promotion (true) predictions that were incorrect. Note that mtry=1 in the second graph is below the range since this model only precited true once, and that prediction was correct (no true prediction was incorrect for mtry=1). Not too bad. mtry=2 seems to be our best bet, giving us an error rate of 35 percent with respect to new superhosts. Let’s try boosting now.

library(gbm)
#need to reformat output to be 0 or 1 to use adaboost
training_listings$promoted[training_listings$promoted] <- 1
training_listings$promoted[!training_listings$promoted] <- 0
test_subset$promoted[test_subset$promoted] <- 1
test_subset$promoted[!test_subset$promoted] <- 0

#Vi is 1:z_review, 2:response_rate, 3:reviews_per_month, 4:years_as_host, and 5:response_time
for (i in 1:5) {
  colnames(training_listings)[i+2] <- paste("V",i,sep="")
  colnames(test_subset)[i+2]       <- paste("V",i,sep="")
}

#display most important variables
boost.first_boost <- gbm(as.factor(promoted)~.-id, data = training_listings, distribution = "multinomial", n.trees = 250, shrinkage = 0.01, interaction.depth = 3)
summary(boost.first_boost)

##    var   rel.inf
## V1  V1 39.768982
## V3  V3 36.100083
## V2  V2 16.131854
## V5  V5  6.017694
## V4  V4  1.981387
correct_falses   = integer(16)
correct_trues    = integer(16)
incorrect_falses = integer(16)
incorrect_trues  = integer(16)
class_error_f    = double(16)
class_error_t    = double(16)

j = 1
for(i in seq(from=1000, to=5000, by=250)){
  boost.first_boost <- gbm(as.factor(promoted)~.-id, data = training_listings, distribution = "multinomial", n.trees = i, shrinkage = 0.01, interaction.depth = 3)
  boost_pred = predict(boost.first_boost, newdata = test_subset, n.trees = i, type="response")
  a = with(test_subset, table(apply(boost_pred,1, which.max), test_subset$promoted))
  correct_falses[j] = a[[1]]
  correct_trues[j]  = a[[4]]
  incorrect_falses[j]  = a[[3]]
  incorrect_trues[j]   = a[[2]]
  class_error_f[j] = incorrect_falses[j]/(incorrect_falses[j]+correct_falses[j])
  class_error_t[j] = incorrect_trues[j]/(incorrect_trues[j]+correct_trues[j])
  print("finished an iteration")
  j = j + 1
}
## [1] "finished an iteration"
## [1] "finished an iteration"
## [1] "finished an iteration"
## [1] "finished an iteration"
## [1] "finished an iteration"
## [1] "finished an iteration"
## [1] "finished an iteration"
## [1] "finished an iteration"
## [1] "finished an iteration"
## [1] "finished an iteration"
## [1] "finished an iteration"
## [1] "finished an iteration"
## [1] "finished an iteration"
## [1] "finished an iteration"
## [1] "finished an iteration"
## [1] "finished an iteration"
## [1] "finished an iteration"
formatted_class_errors <- as_tibble(t(rbind(t(as_tibble(class_error_f)), t(as_tibble(class_error_t)), t(as_tibble(seq(from=1000, to=5000, by=250))))))
#formatted_class_errors$value1[1] <- 1.00

ggplot(data=formatted_class_errors, aes(x=value2, y= value)) +
  geom_line() +
  scale_y_continuous(limits = c(0, .2))+
  xlab("n trees") +
  ylab("class error (false)")

ggplot(data=formatted_class_errors, aes(x=value2, y= value1)) +
  geom_line() +
  scale_y_continuous(limits = c(0, .5))+
  xlab("n trees") +
  ylab("class error (true)")

The graphs here are very similar to the ones we saw with random forests. The best rate at which we get negative predictions wrong is around .09, and the best rate at which we get positive predictions wrong is about .35. At the end of everything, we were able to make two models with somewhat high accuracy on negative members (non-promotions) and decent accuracy on positive members of our dataset. The best boosted model performed similarly to the best random forest model, but took significantly longer to compute.

Closing Thoughts

This data analysis project was much more long winded than it needed to be. In many cases, the search for important features in the data is a lot more fruitful, and leads to better modeling with fewer visualizations. Unfortunately, a lot of what was done here only lead to the conclusion that certain features were actually likely not usable for modeling, and in the end the model did not come out as great as we would have hoped. However, many fundamental data science concepts were demonstrated here, as well as relatively good coding practices, so I think that still makes this demonstration worthwhile.

Thanks for reading!